#Import dataset
loanpredict <- read.csv("Training Data.csv", header = TRUE)

#Convert necessary variables to factors/categoricals and set appropriate level titles
loanpredict$Married.Single <- recode_factor(loanpredict$Married.Single, single = "Single", married = "Married")

loanpredict$House_Ownership <- recode_factor(loanpredict$House_Ownership, rented = "Renting", owned = "Owning", norent_noown = "Neither")

loanpredict$Car_Ownership <- recode_factor(loanpredict$Car_Ownership, no = "No", yes = "Yes")

#Remove variables that won't help in our analysis
loandata <- subset(loanpredict, select = -c(Id, CITY, STATE, Profession))

#Selecting only values where the customer defaulted
defaulted <- subset(loandata, Risk_Flag == 1)

#Selecting only values where the customer did not default
not_defaulted <- subset(loandata, Risk_Flag == 0)

#Import necessary library
library(corrplot)

#Subset out numerical variables for correlation analysis
numericalloan <- subset(loandata, select = c(Income, Age, Experience, CURRENT_JOB_YRS, CURRENT_HOUSE_YRS, Risk_Flag))

str(loandata)
sum(is.na(loandata$Age))
sum(is.na(loandata$Income))
sum(is.na(loandata$Experience))
sum(is.na(loandata$Married.Single))
sum(is.na(loandata$House_Ownership))
sum(is.na(loandata$Car_Ownership))
sum(is.na(loandata$CURRENT_HOUSE_YRS))
sum(is.na(loandata$CURRENT_JOB_YRS))
sum(is.na(loandata$Income))

outlierKD2(defaulted, Age, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

outlierKD2(defaulted, Income, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

outlierKD2(defaulted, Experience, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

outlierKD2(defaulted, CURRENT_JOB_YRS, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

outlierKD2(not_defaulted, Age, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

outlierKD2(not_defaulted, Income, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

outlierKD2(not_defaulted, Experience, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

outlierKD2(not_defaulted, CURRENT_JOB_YRS, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

#### Executive Summary:

We are analyzing consumer loan data in India, sourced from Kaggle. We are analyzing the relationship between several variables and loan default status including income, home ownership, job experience, etc. With the exception of income, all variables tested were significant.

Information Gathered:

All the information was collected at the time of the loan application, according to the Kaggle description. The author of the dataset doesn’t give more specifications about the dataset. The only context he provides is that the dataset belongs to the institution Univ Ai; it was used in their hackathon. The uniform distribution of the data and the limited features of the dataset provided can lead us to believe that the dataset may have been artificially created. We tried to contact the owner of the dataset, but no answer was given.

Dataset:

There is existing research on loan defaults and what variables tend to correlate with it in the United States. For our purposes, we wanted to understand if we would see similar dependencies in India for their consumer loan portfolios. In the US, income correlates highly with default status, where we utilize transformations such as debt-to-income ratios and won’t lend to individuals who don’t have the financial means to pay it back. Another common feature is that a way to determine creditworthiness (essentially a measure of the likelihood someone will pay back a loan) is if someone has trusted you with credit before (and not defaulted on it). In our dataset, we had two proxies for this: owning a home and owning a car, as both typically require consumer loans. Other variables that we know have impacts are employment status (not the job itself, but the existence of the job), age (maturity and income tend to go with age, which will impact responsibility and ability in paying back a loan), and job experience (again, a potential proxy for income). The years of homeownership can be considered a proxy for credit history, as if a loan is required for a home, we know the customer must have a credit history of at least that long, which in the US helps determine how much information on a customer exists. Knowing more information helps banks make more informed decisions, so they’re likelier to loan to customers with extensive history showing ability to pay off loans. This information also goes into the FICO score in the US, but it’s unclear if there’s an equivalent in India for the FICO score. There are two variables in the dataset that we know correlate with default status, but which are illegal in the United States to use in decisioning: age and marital status, as using them is considered discriminatory.

Using this knowledge, we developed questions designed to see if the Indian consumer loan market has similar correlates with default status as the United States market does. Knowing the income correlate, we asked if we would see a similar interaction in India. We wanted to see if homeownership correlated with default status, as in the US it’s an indicator of ability to make consistent payments against a loan. Likewise, we wanted to understand if time of homeownership correlated with decreases in likelihood of defaulting, which could be considered a proxy of credit history. We also wanted to see if age and marital status correlate in India, since we know they do in the US. One theory is that if they do correlate, we could build a better model than we’re allowed to build in the US, given those regulatory constraints. Therefore, we want to understand if younger customers default more than older customers. Given we said job experience can be used as a proxy, but isn’t problematic to use in the United States, we further thought it’d be interesting to compare age and job experience as predictors of default status. The theory here is that if job experience is a better predictor, then we wouldn’t need to use age in the model at all, while if age is the better predictor, then it’s interesting to note that the proxy performs worse, meaning in the US we may be making worse (albeit completely legal) loan decisions by using the proxy.

We would like to use this dataset to help us answer a few key questions related to defaulting on loans:

• Do customers who default on loans have statistically lower incomes than those who don't default?  
• Does home-ownership correlate with lower rates of default?  
• Does being married decrease the likelihood of default?  
• Does an additional year of home-ownership reduce the likelihood of default?  
• Does job experience or age show a larger impact on someone defaulting on their loan?  
    ○ Within that, do job experience and age both negatively correlate with loan defaults?  
• Are customers who default on loans younger than those who do not?  

In order to answer these questions, we need to analyze incomes, home-ownership status, marriage status, years in current house, job experience/experience, and age.

Limitations:

There is a variety of additional information that would have been beneficial to our model. Banks ask for a lot of information to give you a loan, and depending on the loan, they may solicit more information. It would have been ideal if our team had the loan amount and loan amount term. It’s important to know the longevity of the loan; potentially, some customers will default on long-term loans but will not default on short-term loans. Concerning the loan amount, it will have also been helpful to have it in our features. The more customer characteristic and loan characteristics will help our model be more accurate. The goal of the bank is to provide loans to the most qualifying customers. When a bank gives out a loan to a client, it ensures that the customer will be able to pay and reduce the chances of customers receiving a more significant loan than they can handle. There are responsibilities in both parts since the bank has the resources to build better models.

It would have been important if the dataset included the date of the default. Depending on the default date, there may be other factors to consider, such as country political situation, economic crisis, etc. Taking into consideration these parameters of the political environment and instability would be helpful for both the bank and the customer since there are uncontrollable scenarios that banks and customers can not control.

Other potentially valuable features would have been credit history, gender, number of dependents, education level, and type of loan. Intuitively, credit history would have been one of the most beneficial tools in our model; credit history is not included in the dataset, which is odd. In regards to the gender features, according to Durkeiner, M (2019), “An analysis of Experian credit data found that men are more likely to fall 60 days or more behind on their mortgage payments than women, and, in another consumer debt survey, men were responsible for more than 6 out of 10 defaults. Women tend to take on more debt in their lives overall, yet men still lead in defaults” (p. 2). We would have like to test this on our model, but unfortunately, we lack the features such as gender and delayed loan payments. All these features we mentioned would have been beneficial to explore more on what features affect people’s chance to default. Herron, J (2014) mentions the disparity between gender and loan defaults. “While men may seem to be more comfortable taking on more debt, they also get into financial trouble more often”(p.1).

The relationship between gender and loan default is complex and interesting; unfortunately, there is not much research on the topic.

Our dataset has some limitations that we need to call out. First, the dataset focuses on the employed population, so we have no comparison with the unemployed population. This means that we’re partially limited in the scope of the population to whom this analysis applies. Further, the available home ownership years, limits the population to more years of homeownership rather than less (10-14 years only). This further limits to whom our analysis applies. There may be some difficulties in this where the difference between 1 and 4 years of homeownership matter, but the difference between 10 and 14 doesn’t matter as much. That limits our ability to assess differences in homeownership. We see the same limitations with the job experience population.

Further limitations include omitted variables or data points we would want to have to better segment our population. The type of “consumer loans” assessed is unclear, which prevents us from creating more nuanced analysis. For example, if we knew the loans were housing loans, then we wouldn’t consider homeownership at all. Likewise, if these were car loans, we would ignore the car ownership variable. The terms on the loans might also matter, where if it’s a 30 year loan (similar to a US mortgage), we might expect to see more defaults given how much can change in a 30 year period, whereas for a 5 year loan, we wouldn’t expect to see as much fluctuation in the customer status from their initial loan application information. Likewise, we don’t know how long into the life of the loan the customer defaulted. This prevents our model from having a targeted time parameter. For example, in the United States, a lot of banking models rely on 18-24 month default rates as the target parameter. By not knowing the time frame, we can’t determine if a customer will default immediately or if they default many years down the line.

These limitations primarily prevent this analysis from applying to broad ranges of the population, but will still give us a limited understanding for employed Indian customers have lived in their homes for between 10 and 14. Additionally we don’t fully understand the Indian consumer loan market, so we don’t understand all of the possibly types of loans, how they assess loan decisions, or any nuances in who can or can’t receive loans at all. While it’s interesting for us to compare variables that influence the US and Indian loan default statuses, we recognize that the two countries and their financial systems are very different, and therefore the comparisons aren’t like for like, and there may be limited applicability back to the US markets.

Analyses:

#Create summary table of remaining variables
xkablesummary(loandata, title = "Summary Statistics for Loan Default Prediction")
Summary Statistics for Loan Default Prediction
Income Age Experience Married.Single House_Ownership Car_Ownership CURRENT_JOB_YRS CURRENT_HOUSE_YRS Risk_Flag
Min Min. : 10310 Min. :21 Min. : 0.0 Single :226272 Renting:231898 No :176000 Min. : 0.00 Min. :10 Min. :0.000
Q1 1st Qu.:2503015 1st Qu.:35 1st Qu.: 5.0 Married: 25728 Owning : 12918 Yes: 76000 1st Qu.: 3.00 1st Qu.:11 1st Qu.:0.000
Median Median :5000694 Median :50 Median :10.0 NA Neither: 7184 NA Median : 6.00 Median :12 Median :0.000
Mean Mean :4997117 Mean :50 Mean :10.1 NA NA NA Mean : 6.33 Mean :12 Mean :0.123
Q3 3rd Qu.:7477502 3rd Qu.:65 3rd Qu.:15.0 NA NA NA 3rd Qu.: 9.00 3rd Qu.:13 3rd Qu.:0.000
Max Max. :9999938 Max. :79 Max. :20.0 NA NA NA Max. :14.00 Max. :14 Max. :1.000

Examining our summary results, we can determine that Marriage Status, House Ownership, and Car Ownership are categorical variables of some interest to us. Income as quite a wide range of possible values with a minimum of 10310 Rupees and a maximum of 9999938 rupees. We’ll keep this data in the form of Rupees for analysis, but for context this minimum is approximately 134.03 USD and the maximum is approximately 1.310^{5} USD.

Let’s deep dive into the population of customers who have not defaulted on their loans.

#Summary for non-defaulted customers
xkablesummary(not_defaulted, title = "Summary of Data for Non-Defaulted Customers")
Summary of Data for Non-Defaulted Customers
Income Age Experience Married.Single House_Ownership Car_Ownership CURRENT_JOB_YRS CURRENT_HOUSE_YRS Risk_Flag
Min Min. : 10310 Min. :21.0 Min. : 0.0 Single :197912 Renting:202777 No :153439 Min. : 0.00 Min. :10 Min. :0
Q1 1st Qu.:2520633 1st Qu.:35.0 1st Qu.: 5.0 Married: 23092 Owning : 11758 Yes: 67565 1st Qu.: 4.00 1st Qu.:11 1st Qu.:0
Median Median :5002134 Median :50.0 Median :10.0 NA Neither: 6469 NA Median : 6.00 Median :12 Median :0
Mean Mean :5000449 Mean :50.1 Mean :10.2 NA NA NA Mean : 6.36 Mean :12 Mean :0
Q3 3rd Qu.:7470480 3rd Qu.:65.0 3rd Qu.:15.0 NA NA NA 3rd Qu.: 9.00 3rd Qu.:13 3rd Qu.:0
Max Max. :9999938 Max. :79.0 Max. :20.0 NA NA NA Max. :14.00 Max. :14 Max. :0

Looking at the summary data for non-defaulted customers, we can see that the majority of these customers rent their home, own a car, and are not married. Their average age is 50.093 years old and they have, on average 10.162 years of experience working. Further, these customers have an average income of 510^{6} Rupees. These customers also have an average of 6.357 years working at their current jobs, and have lived in their current houses (or apartments, condos, etc.) for an average of 12 years.

#Summary for defaulted customers
xkablesummary(defaulted, title = "Summary of Data for Defaulted Customers")
Summary of Data for Defaulted Customers
Income Age Experience Married.Single House_Ownership Car_Ownership CURRENT_JOB_YRS CURRENT_HOUSE_YRS Risk_Flag
Min Min. : 10675 Min. :21 Min. : 0.00 Single :28360 Renting:29121 No :22561 Min. : 0.00 Min. :10 Min. :1
Q1 1st Qu.:2421029 1st Qu.:33 1st Qu.: 4.00 Married: 2636 Owning : 1160 Yes: 8435 1st Qu.: 3.00 1st Qu.:11 1st Qu.:1
Median Median :4977653 Median :49 Median : 9.00 NA Neither: 715 NA Median : 6.00 Median :12 Median :1
Mean Mean :4973359 Mean :49 Mean : 9.53 NA NA NA Mean : 6.17 Mean :12 Mean :1
Q3 3rd Qu.:7556052 3rd Qu.:64 3rd Qu.:15.00 NA NA NA 3rd Qu.: 9.00 3rd Qu.:13 3rd Qu.:1
Max Max. :9994501 Max. :79 Max. :20.00 NA NA NA Max. :14.00 Max. :14 Max. :1

For the defaulted customers, we can see that they, like the non-defaulted population, primarily rent, own a car, and are not married. They have an average income of RESOLVE! 4.97310^{6} Rupees. These customers also have an average of 6.169 years working at their current jobs, and have lived in their current houses (or apartments, condos, etc.) for an average of 11.981 years. Their average age is 48.96 years old and they have, on average 9.531 years of experience working.

Let’s look at the correlation of our variables before we dig in deeper and use our assumptions that drove our questions.

#Run correlation
correlation = cor(numericalloan)

xkabledply(correlation)
#Make the plot look presentable
corrplot(correlation, method = 'circle', type = 'upper', 
         title = "Correlation Plot for Numerical Loan Variables",
         mar=c(0,0,2,0))

Examining the correlation plot, we see that Experience and # of years in the job are highly correlated, which is expected. No other variables appear to show strong correlations with each other or with risk flags. We can examine this further as we look into the difference between the the defaulted and non-defaulted populations.

Qualitative Data Tests

Chi-Square Test

Let’s conduct our chi-square test for if home-ownership rates differ between those who default and those who don’t. Although a dependency may seem apparent, because our loans are not explicitly home loans, but rather consumer loans, which include car loans, home loans, wedding loans, student loans, etc. we will still analyze the dependency or independency of home-ownership and laon defaults. We set up our home_ownership hypotheses as follows:

\(H_0\): Home Ownership Status and Loan Status are independent

\(H_1\): Home Ownership Status and Loan Status are dependent

As per usual, we will use \(\alpha\) = 0.05.

#contigeny table
contable_housing = table(loandata$House_Ownership, loandata$Risk_Flag)
xkabledply(contable_housing, title = 'Contigency table for Risk Flag vs House Ownership ')
Contigency table for Risk Flag vs House Ownership
0 1
Renting 202777 29121
Owning 11758 1160
Neither 6469 715
chitest_housing = chisq.test(contable_housing)

#To output results
chitest_housing
library(lmtest)
library(grid)
library(vcd)

#Graphical representation of chi-square plot
mosaic(~ Risk_Flag + House_Ownership ,
  direction = c("v", "h"),
  data = loandata,
  shade = TRUE
)

This Chi-squared test shows that pvalue = 1.83810^{-40} which means is statistically significant and indicates strong evidence against \(H_0\) therefore we adopt \(H_1\). After this test we reject our \(H_0\) and accept \(H_1\) Home Ownership Status and Loan Status are dependent. Knowing the value of one variables helps to predict the value of the other variable.

Next, we’ll assess if marital status has an impact on default status. We set up our marital status hypotheses as follows:

\(H_0\): Marital Status and Loan Status are independent

\(H_1\): Marital Status and Loan Status are dependent

As per usual, we will use \(\alpha\) = 0.05.

contable_marital = table(loandata$Married.Single, loandata$Risk_Flag)
xkabledply(contable_marital, title = 'Contigency table for Risk Flag vs Marital Status ')
Contigency table for Risk Flag vs Marital Status
0 1
Single 197912 28360
Married 23092 2636
chitest_marital = chisq.test(contable_marital)
chitest_marital
#Produces graphical representation of chi-square
mosaic(~ Risk_Flag + Married.Single ,
  direction = c("v", "h"),
  data = loandata,
  shade = TRUE
)

When we examine the Chi-squared test shows of Risk Flag vs Marital the p-value = 3.77310^{-26} is statistically significant and indicates strong evidence against \(H_0\) therefore we adopt \(H1_0\). After this test we are assuming Marital Status and Loan Status are not independent. Therefore knowing the value of one variable helps to predict the value of the other variable.

When we examine the Chi-squared test shows of Risk Flag vs Marital the p-value = 3.77310^{-26} is statistically significant and indicates strong evidence against \(H_0\) therefore we adopt \(H_1\). After this test we are assuming Home Ownership Status and Loan Status are not independent. Therefore knowing the value of one variable helps to predict the value of the other variable.

Quantitative Variables

Two Sample T-Test

Now we’ll move back to our quantitative variables.

First, let’s analyze the income levels and if they differ significantly for defaulted vs. non-defaulted customers. Given that we are trying to assess if customers who have defaulted have lower incomes than those who haven’t, we would set up our hypothesis test as follows:

\(H_0\): \(\mu_{defaulted} \geq \mu_{not defaulted}\)

\(H_1\): \(\mu_{defaulted} < \mu_{not defaulted}\)

As per usual, we will use \(\alpha\) = 0.05.

ttest2sample_incomes = t.test(defaulted$Income, not_defaulted$Income, alternative = 'less')
#Output results
ttest2sample_incomes

Here we fail to reject the Null Hypothesis \(H_0\). We do not have enough evidence that the average income of defaulted vs. non-defaulted customers is different. We fail to reject th enull hypothesis with a p-value of 0.0627, which is greater than \(\alpha\).

Now lets asses if years in current home is significantly different for those who have defaulted vs not defaulted. We set up our hypotheses for years in current home as follows:

\(H_0\): \(\mu_{defaulted} \geq \mu_{not defaulted}\)

\(H_1\): \(\mu_{defaulted} < \mu_{not defaulted}\)

As per usual, we will use \(\alpha\) = 0.05.

ttest2sample_currentHome = t.test(defaulted$CURRENT_HOUSE_YRS, not_defaulted$CURRENT_HOUSE_YRS, alternative = 'less')
#Output results
ttest2sample_currentHome

With a p-value of 0.0141, which is less than \(\alpha\), we can reject the Null Hypothesis \(H_0\) in favor of the alternative hypothesis \(H_1\). Therefore, at the \(\alpha\) = 0.05 level, we can say that the average number of years in the current house for someone who has defaulted is significantly less than for someone who did not default.

Now we’ll examine overall job experience and if it differs between the two loan statuses. We set up our hypotheses for overall job experience as follows:

\(H_0\): \(\mu_{defaulted} = \mu_{not defaulted}\)

\(H_1\): \(\mu_{defaulted} \neq \mu_{not defaulted}\)

As per usual, we will use \(\alpha\) = 0.05.

ttest2sample_experience <- t.test(defaulted$Experience, not_defaulted$Experience, alternative = "two.sided")
#Output results
ttest2sample_experience

Based on our t-test, we again reject the null hypothesis with a p-value of 8.91e-66 in favor of the alternative hypothesis \(H_1\). Therefore, we can say that at an \(\alpha\) = 0.05 level, the years of experience for someone who has defaulted differs significantly from the years of experience for someone who has not-defaulted.

We know overall job experience and years in current job are highly correlated, but it is still worth examining them separately even if we it is likely we’ll see the same result. Our hypotheses for years in current job are as follows:

\(H_0\): \(\mu_{defaulted} = \mu_{not defaulted}\)

\(H_1\): \(\mu_{defaulted} \neq \mu_{not defaulted}\)

As per usual, we will use \(\alpha\) = 0.05.

ttest2sample_currentjob <- t.test(defaulted$CURRENT_JOB_YRS, not_defaulted$CURRENT_JOB_YRS)
ttest2sample_currentjob

Our test again shows our p-value of 1.02e-16 is less than our \(\alpha\), so we can reject the null hypothesis in favor of the alternative hypothesis. Therefore, at the \(\alpha\) = 0.05 level, we can say that years in current job differs significantly between those who have defaulted and those who haven’t.

Lastly, we’ll examine age and if customers who default on loans are significantly younger than those who do not. We’ll set up our age hypotheses as follows:

\(H_0\): \(\mu_{defaulted} \geq \mu_{not defaulted}\)

\(H_1\): \(\mu_{defaulted} < \mu_{not defaulted}\)

As per usual, we will use \(\alpha\) = 0.05.

ttest2sample_age <- t.test(defaulted$Age, not_defaulted$Age, alternative = 'less')
ttest2sample_age

With our p-value of 2.27e-27, which is less than our \(\alpha\)=0.05, we can reject the null hypothesis in favor of the alternative. Therefore, the average age of those who have defaulted is statistically significantly lower than the average age of those who have not defaulted.

Summary

Few of the questions were based upon the assumptions from consumer loans data from the United States, where income, home ownership and job experience significantly impact the default status of a customer. With the data with consumer loans in India, we discovered that income did not have an impact on the default status, the income levels of both groups who defaulted and who did not, wasn’t significantly different. This made us explore other characteristics of the customers such as the correlation of home ownership, years in current home and years of job experience as a factor in default status. Our questions did not change but it did change the way we would look at all the other characteristics.

From the EDA, we discovered that there is not enough evidence to state that the customers who default on loans have lower incomes than those who do not. Our analysis showed the not-defaulted customers had an average income of 510^{6} and the average income of defaulted customers at: 4.97310^{6}. We further analyzed the average incomes of both the groups using a 2-Sample T-Test, where we failed to reject our Null Hypothesis and were not able to prove that the averages were significantly different.

Using Chi-Square Test, we were able to see that Home Ownership and Default status were dependent on each other. At p-value: 1.83810^{-40}, (with \(\alpha\) at 0.05), we were able to reject our Null Hypothesis that Home Ownership and Default status were independent and adopt the alternate that there was a significant dependence on each other.

Using Chi-Square Test, we were able to determine that the Martial Status and Default status were also dependent of each other. With a p-value: 3.77310^{-26}, we were able to reject our Null Hypothesis of these two characteristics being independent and adopted the alternative as we see a significant dependency on each other.

Although we can’t confirm if an additional year of home-ownership reduces the likelihood of default, we can say that years of home-ownership differs significantly between those who have defaulted and those who haven’t. Again using the 2 Sample T-test, we are able to state that the years of job experience significantly differs for customers who default vs who do not. At \(\alpha\) at 0.05, and p-value at: 8.91e-66, we were able to reject our Null Hypothesis and adopt the Alternate Hypothesis.

With regards to age, we did see a significant difference in age between the customers who defaulted vs who did not, the customer who defaulted were significantly younger than the ones who did not. Both the job experience and age, negatively correlate with the default status.

Appendix

References

Durkheimer, M. (2019, August 22). Millennial men more likely than women to default on student debt. Forbes. Retrieved November 5, 2021, from https://www.forbes.com/sites/michaeldurkheimer/2017/09/25/millennial-men-more-likely-than-women-to-default-on-student-debt/?sh=4d8ba5089ea8.

Herron, J. (2014, April 4). Men, women and debt: Does gender matter? Bankrate. Retrieved November 7, 2021, from https://www.bankrate.com/personal-finance/debt/men-women-and-debt-does-gender-matter/.

Surana, S. (2021, August 15). Loan Prediction Based on Customer Behavior | Kaggle. Kaggle. Retrieved November 8, 2021, from https://www.kaggle.com/subhamjain/loan-prediction-based-on-customer-behavior/discussion.

Boxplots

Income Boxplot

#Import necessary library
library("ggplot2")

#Convert Risk_Flag to factor level
loandata$Risk_Flag <- recode_factor(loandata$Risk_Flag, "0" = "Non-Defaulted", "1" = "Defaulted")

ggplot(loandata, aes(x=Risk_Flag, y=Income)) + 
  geom_boxplot( colour="orange", fill="#7777cc", outlier.colour="red", outlier.shape=8, outlier.size=4)+
  labs(title="Income based on Loan Default Status",x="Default Status", y = "Income")

Contigency Table

cont_table_housing <- table(loandata$House_Ownership, loandata$Risk_Flag)
xkabledply(cont_table_housing, title="Contingency table for Home-Ownership vs. Loan Default Status")
Contingency table for Home-Ownership vs. Loan Default Status
Non-Defaulted Defaulted
Renting 202777 29121
Owning 11758 1160
Neither 6469 715
cont_table_marriage <- table(loandata$Married.Single, loandata$Risk_Flag)
xkabledply(cont_table_marriage, title="Contingency table for Marriage Status vs. Loan Default Status")
Contingency table for Marriage Status vs. Loan Default Status
Non-Defaulted Defaulted
Single 197912 28360
Married 23092 2636

Years in Current Home Boxplot

ggplot(loandata, aes(x=Risk_Flag, y=CURRENT_HOUSE_YRS)) + 
  geom_boxplot( colour="orange", fill="#7777cc", outlier.colour="red", outlier.shape=8, outlier.size=4)+
  labs(title="Years in Current House based on Loan Default Status",x="Default Status", y = "Years in Current House")

Years in Current Job Boxplot

ggplot(loandata, aes(x=Risk_Flag, y=CURRENT_JOB_YRS)) + 
  geom_boxplot( colour="orange", fill="#7777cc", outlier.colour="red", outlier.shape=8, outlier.size=4)+
  labs(title="Years in Current Job based on Loan Default Status",x="Default Status", y = "Years in Current Job")

Job Experience Boxplot

ggplot(loandata, aes(x=Risk_Flag, y=Experience)) + 
  geom_boxplot( colour="orange", fill="#7777cc", outlier.colour="red", outlier.shape=8, outlier.size=4)+
  labs(title="Years Overall Job Experience based on Loan Default Status",x="Default Status", y = "Total Job Experience")

Age Boxplot

ggplot(loandata, aes(x=Risk_Flag, y=Age)) + 
  geom_boxplot( colour="orange", fill="#7777cc", outlier.colour="red", outlier.shape=8, outlier.size=4)+
  labs(title="Age based on Loan Default Status",x="Default Status", y = "Age")

Assesment non-defaulted population.

Analysis for Age
#str(loandata)


ggplot(data=not_defaulted, aes(Age)) + 
  geom_histogram(aes(y=..density..), binwidth = 2,
                 col="aquamarine4", 
                 fill="aquamarine2", 
                 alpha = .7) + 
  labs(title="Non-defaulted Customers Age Histogram") +
  stat_function(fun=dnorm, lwd = 1, col = 'red', args = list(mean = mean(not_defaulted$Age, na.rm = TRUE), sd = sd(not_defaulted$Age, na.rm = TRUE))) + 
  labs(x="Age", y="Frequency") 

qqnorm(not_defaulted$Age, main="Q-Q plot of Age level of Not Defaulted Customers", col = 'seagreen3') 
qqline(not_defaulted$Age, col = 'red')

##shapiro.test(not_defaulted$Age) sample size to big to conduct a shapiro test, not necessary 
Analysis for Income
ggplot(data=not_defaulted, aes(Income)) +
  geom_histogram(aes(y=..density..),
                 col="dodgerblue4",
                 fill="dodgerblue2",
                 alpha = .7) +
  labs(title="Non-defaulted Customers Income Histogram") +
  stat_function(fun=dnorm, lwd = 1, col = 'red', args = list(mean = mean(not_defaulted$Income, na.rm = TRUE), sd = sd(not_defaulted$Income, na.rm = TRUE))) +
  labs(x="Income", y="Frequency")

qqnorm(not_defaulted$Income, main="Q-Q plot of Income level of Not Defaulted Customers", col = 'seagreen3') 
qqline(not_defaulted$Income, col = 'red')

Analysis for Experience
#str(loandata)

ggplot(data=not_defaulted, aes(Experience)) + 
  geom_histogram(aes(y=..density..), binwidth = 1,
                 col="gold4", 
                 fill="gold2", 
                 alpha = .7) + 
  stat_function(fun=dnorm, lwd = 1, col = 'red', args = list(mean = mean(not_defaulted$Experience), sd = sd(not_defaulted$Experience))) + 
  labs(title="Non-defaulted Job Experience Histogram") +
  labs(x="Experience", y="Frequency") 

qqnorm(not_defaulted$Experience, main="Q-Q plot of Job Experience of Not Defaulted Customers", col = 'seagreen3') 
qqline(not_defaulted$Experience, col = 'red')

Analysis for Current House Years
ggplot(data=not_defaulted, aes(CURRENT_HOUSE_YRS)) + 
  geom_histogram(aes(y=..density..), binwidth = 1,
                 col="brown4", 
                 fill="brown2", 
                 alpha = .7) + 
  stat_function(fun=dnorm, lwd = 1, col = 'red', args = list(mean = mean(not_defaulted$CURRENT_HOUSE_YRS, na.rm = TRUE), sd = sd(not_defaulted$CURRENT_HOUSE_YRS, na.rm = TRUE))) +
  labs(title="Non-defaulted Customers Years in Current Home Histogram") +
  labs(x='Years in Current Home', y="Frequency") 

qqnorm(not_defaulted$CURRENT_HOUSE_YRS, main="Q-Q plot of Years in Current Home of Not Defaulted Customers", col = 'seagreen3') 
qqline(not_defaulted$CURRENT_HOUSE_YRS, col = 'red')

Analysis for Current Job Years
ggplot(data=not_defaulted, aes(CURRENT_JOB_YRS)) + 
  geom_histogram(aes(y=..density..), binwidth = 1,
                 col="burlywood4", 
                 fill="burlywood2", 
                 alpha = .7) + 
  labs(title="Non-defaulted Years in Current Job Histogram") +
  stat_function(fun=dnorm, lwd = 1, col = 'red', args = list(mean = mean(not_defaulted$CURRENT_JOB_YRS, na.rm = TRUE), sd = sd(not_defaulted$CURRENT_JOB_YRS, na.rm = TRUE))) +
  labs(x="Years in Current Job", y="Frequency") 

qqnorm(not_defaulted$CURRENT_JOB_YRS, main="Q-Q plot of Years in Current Job Years of Not Defaulted Customers", col = 'seagreen3') 
qqline(not_defaulted$CURRENT_JOB_YRS, col = 'red')

Assesment defaulted population.

Analysis for Age
ggplot(data=defaulted, aes(Age)) + 
  geom_histogram(aes(y=..density..),
                 binwidth = 2,
                 col="darkolivegreen4", 
                 fill="darkolivegreen2", 
                 alpha = .7) + 
  stat_function(fun=dnorm, lwd = 1, col = 'red', args = list(mean = mean(defaulted$Age, na.rm = TRUE), sd = sd(defaulted$Age, na.rm = TRUE))) + 
  labs(title="Defaulted Customers Age Histogram") +
  labs(x="Age", y="Frequency") 

qqnorm(defaulted$Age, main="Q-Q plot of Age Defaulted Customers", col = 'royalblue2' ) 
qqline(defaulted$Age, col = 'red')

Analysis for Income
ggplot(data=defaulted, aes(Income)) +
  geom_histogram(aes(y=..density..),
                 col="magenta4",
                 fill="magenta2",
                 alpha = .7) +
  stat_function(fun=dnorm, lwd = 1, col = 'red', args = list(mean = mean(defaulted$Income, na.rm = TRUE), sd = sd(defaulted$Income, na.rm = TRUE))) +
  labs(title="Defaulted Customers Income Histogram") +
  labs(x="Income", y="Frequency")

qqnorm(defaulted$Income, main="Q-Q plot of Income of Defaulted Customers", col = 'royalblue2') 
qqline(defaulted$Income, col = 'red')

Analysis for Experience
#str(loandata)
ggplot(data=defaulted, aes(Experience)) + 
  geom_histogram(aes(y=..density..),
                 binwidth = 1,
                 col="darkorange4", 
                 fill="darkorange2", 
                 alpha = .7) + 
  stat_function(fun=dnorm, lwd = 1, col = 'red', args = list(mean = mean(defaulted$Experience, na.rm = TRUE), sd = sd(defaulted$Experience, na.rm = TRUE))) +
  labs(title="Defaulted Experience Histogram") +
  labs(x="Experience", y="Frequency") 

qqnorm(defaulted$Experience, main="Q-Q plot of Years of Job Experience of Defaulted Customers", col = 'royalblue2') 
qqline(defaulted$Experience, col = 'red')

Analysis for Years in Current home
ggplot(data=defaulted, aes(CURRENT_HOUSE_YRS)) + 
  geom_histogram(aes(y=..density..),
                 binwidth = 1,
                 col="ivory4", 
                 fill="ivory2", 
                 alpha = .7) + 
  stat_function(fun=dnorm, lwd = 1, col = 'red', args = list(mean = mean(defaulted$CURRENT_HOUSE_YRS, na.rm = TRUE), sd = sd(defaulted$CURRENT_HOUSE_YRS, na.rm = TRUE))) +
  labs(title="Defaulted Years in Current Home Histogram") +

  labs(x="Years in Current Home", y="Frequency") 

  labs(x="Current Home", y="Frequency") 


qqnorm(defaulted$CURRENT_HOUSE_YRS, main="Q-Q plot of Years in Current Home, for Defaulted Customers", col = 'royalblue2') 
qqline(defaulted$CURRENT_HOUSE_YRS, col = 'red')

Analysis for Years in Current Job
ggplot(data=defaulted, aes(CURRENT_JOB_YRS)) + 
  geom_histogram(aes(y=..density..),
                 binwidth = 1,
                 col="lightpink4", 
                 fill="lightpink2", 
                 alpha = .7) + 
  stat_function(fun=dnorm, lwd = 1, col = 'red', args = list(mean = mean(defaulted$CURRENT_JOB_YRS, na.rm = TRUE), sd = sd(defaulted$CURRENT_JOB_YRS, na.rm = TRUE))) +
  labs(title="Defaulted - Years in Current Job Histogram") +
  labs(x="Years in Current Job", y="Frequency") 

qqnorm(defaulted$CURRENT_JOB_YRS, main="Q-Q plot of Years in Current Job, for Defaulted Customers", col = 'royalblue2') 
qqline(defaulted$CURRENT_JOB_YRS, col = 'red')